Sublinear regret for learning POMDPs
نویسندگان
چکیده
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is optimal policy of POMDP with a known environment in terms average reward over an infinite horizon. propose algorithm this problem, building on spectral method-of-moments estimations hidden models, belief error control POMDPs and upper confidence bound methods online learning. establish regret O ( T 2 / 3 log ) $O(T^{2/3}\sqrt {\log T})$ proposed where This is, to best our knowledge, first achieving sublinear respect general POMDPs.
منابع مشابه
Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret
Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...
متن کاملAlgorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The budget and time constraints significantly increase the complexity of exploration-exploitation tradeoff because they introduce coupling among contexts. Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. T...
متن کاملRegret Bounds for Lifelong Learning
We consider the problem of transfer learning in an online setting. Di↵erent tasks are presented sequentially and processed by a within-task algorithm. We propose a lifelong learning strategy which refines the underlying data representation used by the withintask algorithm, thereby transferring information from one task to the next. We show that when the within-task algorithm comes with some reg...
متن کاملSublinear-Time Optimization for High-Dimensional Learning
Across domains, the scale of data and complexity of machine learning models have both been increasing greatly in recent years. For many such large scale models of interest, tractable estimation without access to extensive computational infrastructure is still an open problem. In this thesis, we approach the tractability of large-scale estimation problems through the lens of leveraging sparse st...
متن کاملScalable Planning and Learning for Multiagent POMDPs
Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable appr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Production and Operations Management
سال: 2022
ISSN: ['1059-1478', '1937-5956']
DOI: https://doi.org/10.1111/poms.13778